{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Python实现批量英语单词翻译"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "***用途：***   \n",
    "对批量的英语文本，生成英语-汉语翻译的单词本，提供Excel下载\n",
    "\n",
    "***本代码实现：***\n",
    "1. 提供一个英文文章URL，自动下载网页；\n",
    "2. 实现网页中所有英语单词的翻译；\n",
    "3. 下载翻译结果的Excel\n",
    "\n",
    "***涉及技术：***\n",
    "1. pandas的读取csv、多数据merge、输出Excel\n",
    "2. requests库下载HTML网页\n",
    "3. BeautifulSoup解析HTML网页\n",
    "4. Python正则表达式实现英文分词"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1. 读取英语-汉语翻译词典文件"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "词典文件来自：https://github.com/skywind3000/ECDICT\n",
    "使用步骤：\n",
    "1. 下载代码打包：https://github.com/skywind3000/ECDICT/archive/master.zip\n",
    "2. 解压master.zip，然后解压其中的‪stardict.csv文件"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "d:\\Anaconda3\\lib\\site-packages\\IPython\\core\\interactiveshell.py:3063: DtypeWarning: Columns (11) have mixed types.Specify dtype option on import or set low_memory=False.\n",
      "  interactivity=interactivity, compiler=compiler, result=result)\n"
     ]
    }
   ],
   "source": [
    "# 注意：stardict.csv的地址需要替换成你自己的文件地址\n",
    "df_dict = pd.read_csv(\"D:/tmp/ECDICT-master/stardict.csv\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(3402564, 13)"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_dict.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>word</th>\n",
       "      <th>phonetic</th>\n",
       "      <th>definition</th>\n",
       "      <th>translation</th>\n",
       "      <th>pos</th>\n",
       "      <th>collins</th>\n",
       "      <th>oxford</th>\n",
       "      <th>tag</th>\n",
       "      <th>bnc</th>\n",
       "      <th>frq</th>\n",
       "      <th>exchange</th>\n",
       "      <th>detail</th>\n",
       "      <th>audio</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1267018</th>\n",
       "      <td>golden maidenhair</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>[网络] 金色的铁线</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>s:golden maidenhairs</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>951520</th>\n",
       "      <td>electrical element</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>[电] 电元件</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>740373</th>\n",
       "      <td>cyolite</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>[网络] 花岗岩</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>989569</th>\n",
       "      <td>entomological</td>\n",
       "      <td>.entәmә'lɒdʒikәl</td>\n",
       "      <td>a. of or relating to the biological science of...</td>\n",
       "      <td>a. 昆虫学的\\n[医] 昆虫学的</td>\n",
       "      <td>j:100</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>44646.0</td>\n",
       "      <td>39435.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2722418</th>\n",
       "      <td>servomotor control</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>[自] 伺服电动机控制</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                       word          phonetic  \\\n",
       "1267018   golden maidenhair               NaN   \n",
       "951520   electrical element               NaN   \n",
       "740373              cyolite               NaN   \n",
       "989569        entomological  .entәmә'lɒdʒikәl   \n",
       "2722418  servomotor control               NaN   \n",
       "\n",
       "                                                definition        translation  \\\n",
       "1267018                                                NaN         [网络] 金色的铁线   \n",
       "951520                                                 NaN            [电] 电元件   \n",
       "740373                                                 NaN           [网络] 花岗岩   \n",
       "989569   a. of or relating to the biological science of...  a. 昆虫学的\\n[医] 昆虫学的   \n",
       "2722418                                                NaN        [自] 伺服电动机控制   \n",
       "\n",
       "           pos  collins  oxford  tag      bnc      frq              exchange  \\\n",
       "1267018    NaN      NaN     NaN  NaN      0.0      0.0  s:golden maidenhairs   \n",
       "951520     NaN      NaN     NaN  NaN      0.0      0.0                   NaN   \n",
       "740373     NaN      NaN     NaN  NaN      NaN      NaN                   NaN   \n",
       "989569   j:100      NaN     NaN  NaN  44646.0  39435.0                   NaN   \n",
       "2722418    NaN      NaN     NaN  NaN      NaN      NaN                   NaN   \n",
       "\n",
       "        detail  audio  \n",
       "1267018    NaN    NaN  \n",
       "951520     NaN    NaN  \n",
       "740373     NaN    NaN  \n",
       "989569     NaN    NaN  \n",
       "2722418    NaN    NaN  "
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_dict.sample(10).head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>word</th>\n",
       "      <th>translation</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>'a</td>\n",
       "      <td>na. 一\\nn. 英文字母表的第一字母；【乐】A音\\nart. 冠以不定冠词主要表示类别\\...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>'A' game</td>\n",
       "      <td>[网络] 游戏；一个游戏；一局</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>'Abbāsīyah</td>\n",
       "      <td>[地名] 阿巴西耶 ( 埃 )</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>'Abd al Kūrī</td>\n",
       "      <td>[地名] 阿卜杜勒库里岛 ( 也门 )</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>'Abd al Mājid</td>\n",
       "      <td>[地名] 阿卜杜勒马吉德 ( 苏丹 )</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            word                                        translation\n",
       "0             'a  na. 一\\nn. 英文字母表的第一字母；【乐】A音\\nart. 冠以不定冠词主要表示类别\\...\n",
       "1       'A' game                                    [网络] 游戏；一个游戏；一局\n",
       "2     'Abbāsīyah                                    [地名] 阿巴西耶 ( 埃 )\n",
       "3   'Abd al Kūrī                                [地名] 阿卜杜勒库里岛 ( 也门 )\n",
       "4  'Abd al Mājid                                [地名] 阿卜杜勒马吉德 ( 苏丹 )"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 把word、translation之外的列扔掉\n",
    "df_dict = df_dict[[\"word\", \"translation\"]]\n",
    "df_dict.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2. 下载网页，得到网页内容"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "import requests"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Pandas官方文档中的一个URL\n",
    "url = \"https://pandas.pydata.org/docs/user_guide/indexing.html\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "html_cont = requests.get(url).text"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'\\n\\n<!DOCTYPE html>\\n\\n<html xmlns=\"http://www.w3.org/1999/xhtml\">\\n  <head>\\n    <meta charset=\"utf-8\" />'"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "html_cont[:100]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3. 提取HTML的正文内容\n",
    "即：去除HTML标签，获取正文"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "from bs4 import BeautifulSoup\n",
    "soup = BeautifulSoup(html_cont)\n",
    "html_text = soup.get_text()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'\\n\\n\\nIndexing and selecting data — pandas 1.0.1 documentation\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\nMathJax.Hub.Config({\"tex2jax\": {\"inlineMath\": [[\"$\", \"$\"], [\"\\\\\\\\(\", \"\\\\\\\\)\"]], \"processEscapes\": true, \"ignoreClass\": \"document\", \"processClass\": \"math|output_area\"}})\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\nHome\\n\\n\\nWhat\\'s New in 1.0.0\\n\\n\\nGetting started\\n\\n\\nUser Guide\\n\\n\\nAPI reference\\n\\n\\nDevelopment\\n\\n\\nRelease Notes\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\nIO tools (text, CSV, HDF5, â\\x80¦)\\n\\n\\nIndexing and selecting data\\n\\n\\nMultiIndex / advanced indexing\\n\\n\\nMerge, join, a'"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "html_text[:500]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4. 英文分词和数据清洗"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['', '', '', 'Indexing', 'and', 'selecting', 'data', '—', 'pandas', '1']"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 分词\n",
    "import re\n",
    "word_list = re.split(\"\"\"[ ,.\\(\\)/\\n|\\-:=\\$\\[\"']\"\"\",html_text)\n",
    "word_list[:10]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['faces',\n",
       " '',\n",
       " 'says',\n",
       " 'consequently',\n",
       " 'kind',\n",
       " 'known',\n",
       " \"here's\",\n",
       " 'inc',\n",
       " 'out',\n",
       " 'man']"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 读取停用词表，从网上复制的，位于当前目录下\n",
    "with open(\"./datas/stop_words/stop_words.txt\") as fin:\n",
    "    stop_words=set(fin.read().split(\"\\n\"))\n",
    "list(stop_words)[:10]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['indexing',\n",
       " 'selecting',\n",
       " 'data',\n",
       " 'pandas',\n",
       " 'documentation',\n",
       " 'mathjax',\n",
       " 'hub',\n",
       " 'config',\n",
       " 'tex2jax',\n",
       " 'inlinemath',\n",
       " '\\\\\\\\',\n",
       " '\\\\\\\\',\n",
       " ']]',\n",
       " 'processescapes',\n",
       " 'true',\n",
       " 'ignoreclass',\n",
       " 'document',\n",
       " 'processclass',\n",
       " 'math',\n",
       " 'output_area']"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 数据清洗\n",
    "word_list_clean = []\n",
    "for word in word_list:\n",
    "    word = str(word).lower().strip()\n",
    "    # 过滤掉空词、数字、单个字符的词、停用词\n",
    "    if not word or word.isnumeric() or len(word)<=1 or word in stop_words:\n",
    "        continue\n",
    "    word_list_clean.append(word)\n",
    "word_list_clean[:20]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 5. 分词结果构造成一个DataFrame"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>word</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>indexing</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>selecting</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>data</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>pandas</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>documentation</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            word\n",
       "0       indexing\n",
       "1      selecting\n",
       "2           data\n",
       "3         pandas\n",
       "4  documentation"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_words = pd.DataFrame({\n",
    "    \"word\": word_list_clean\n",
    "})\n",
    "df_words.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(4915, 1)"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_words.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>word</th>\n",
       "      <th>count</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>620</th>\n",
       "      <td>df</td>\n",
       "      <td>161</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>659</th>\n",
       "      <td>dtype</td>\n",
       "      <td>87</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1274</th>\n",
       "      <td>true</td>\n",
       "      <td>86</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>593</th>\n",
       "      <td>dataframe</td>\n",
       "      <td>80</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1038</th>\n",
       "      <td>pd</td>\n",
       "      <td>75</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>917</th>\n",
       "      <td>loc</td>\n",
       "      <td>72</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>970</th>\n",
       "      <td>nan</td>\n",
       "      <td>72</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>721</th>\n",
       "      <td>false</td>\n",
       "      <td>58</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>914</th>\n",
       "      <td>list</td>\n",
       "      <td>58</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>835</th>\n",
       "      <td>indexing</td>\n",
       "      <td>53</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "           word  count\n",
       "620          df    161\n",
       "659       dtype     87\n",
       "1274       true     86\n",
       "593   dataframe     80\n",
       "1038         pd     75\n",
       "917         loc     72\n",
       "970         nan     72\n",
       "721       false     58\n",
       "914        list     58\n",
       "835    indexing     53"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 统计词频\n",
    "df_words = (\n",
    "    df_words\n",
    "    .groupby(\"word\")[\"word\"]\n",
    "    .agg(count=\"size\")\n",
    "    .reset_index()\n",
    "    .sort_values(by=\"count\", ascending=False)\n",
    ")\n",
    "df_words.head(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 6. 和单词词典实现merge"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_merge = pd.merge(\n",
    "    left = df_dict,\n",
    "    right = df_words,\n",
    "    left_on = \"word\",\n",
    "    right_on = \"word\"\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>word</th>\n",
       "      <th>translation</th>\n",
       "      <th>count</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>510</th>\n",
       "      <td>previous</td>\n",
       "      <td>a. 早先的, 前面的, 过急的\\n[法] 以前的, 生前的, 前述的</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>183</th>\n",
       "      <td>desired</td>\n",
       "      <td>a. 渴望的；想得到的</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>499</th>\n",
       "      <td>portion</td>\n",
       "      <td>n. 部分, 一份, 命运, 嫁妆\\nvt. 分配, 给...嫁妆</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>130</th>\n",
       "      <td>computational</td>\n",
       "      <td>a. 计算的</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>347</th>\n",
       "      <td>inform</td>\n",
       "      <td>vt. 通知, 使了解, 使充满\\nvi. 提供资料, 告发</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>452</th>\n",
       "      <td>nullable</td>\n",
       "      <td>[计] 可空的</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>448</th>\n",
       "      <td>normalized</td>\n",
       "      <td>a. 规格化的；标准化的；正规化的</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>225</th>\n",
       "      <td>enhancing</td>\n",
       "      <td>v. 提高( enhance的现在分词 ); 增进; 用计算机增强（照片等）; 提高…的价值...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>257</th>\n",
       "      <td>expr</td>\n",
       "      <td>abbr. 表达式</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>437</th>\n",
       "      <td>multiset</td>\n",
       "      <td>[计] 多重集</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "              word                                        translation  count\n",
       "510       previous                a. 早先的, 前面的, 过急的\\n[法] 以前的, 生前的, 前述的      3\n",
       "183        desired                                        a. 渴望的；想得到的      3\n",
       "499        portion                  n. 部分, 一份, 命运, 嫁妆\\nvt. 分配, 给...嫁妆      1\n",
       "130  computational                                             a. 计算的      1\n",
       "347         inform                     vt. 通知, 使了解, 使充满\\nvi. 提供资料, 告发      1\n",
       "452       nullable                                            [计] 可空的      2\n",
       "448     normalized                                  a. 规格化的；标准化的；正规化的      2\n",
       "225      enhancing  v. 提高( enhance的现在分词 ); 增进; 用计算机增强（照片等）; 提高…的价值...      1\n",
       "257           expr                                          abbr. 表达式      2\n",
       "437       multiset                                            [计] 多重集      1"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_merge.sample(10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(718, 3)"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_merge.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 7. 存入Excel"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_merge.to_excel(\"./38. batch_chinese_english.xlsx\", index=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 后续升级：\n",
    "1. 可以提供txt/excel/word/pdf的批量输入，生成单词本；\n",
    "2. 可以做成网页、微信小程序的形式，在线访问和使用\n",
    "3. 用户可以标记或上传“已经认识的词语”，每次过滤掉"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
